首页> 外文OA文献 >Parallelizing general histogram application for CUDA architectures
【2h】

Parallelizing general histogram application for CUDA architectures

机译:并行化CUDa架构的一般直方图应用程序

摘要

Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.
机译:直方图是数据分析中常用的工具。尽管其串行版本易于实现,但是提供一种高效且可扩展的方式来并行化它可能是具有挑战性的。由于域分解,全局内存的使用等问题,在包含一个或几个大规模并行设备(例如具有CUDA功能的GPU)的平台的情况下,尤其如此。在本文中,我们比较了两种在GPU上实现通用直方图的方法。第一种算法基于每个线程块在共享内存中存储的bin计数器的私有副本。第二个使用Thrust库对输入元素进行排序,然后根据bin宽度搜索上限。对于这两种算法,我们分析了顺序版本上的提速如何取决于输入集合的大小,容器数量以及输入元素的类型和分布。我们还通过内核执行实现了主机CPU和CUDA设备之间的数据传输重叠。对于这两种算法,我们都会详细分析其优缺点。例如,私有化策略的速度可能比使用实际输入的排序搜索快2倍,但只能支持有限数量的垃圾箱。另一方面,当我们使用字符作为输入并且可以支持无限数量的垃圾箱时,分类搜索策略的速度比私有化要快50%。最后,我们进行探索以根据输入参数的特征和值确定最佳算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号